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Abstract 

Bootstrap techniques (also called resampling computation techniques) have in- 
troduced new advances in modeling and model evaluation 1 10|. Using resampling 
methods to construct a series of new samples which are based on the original data 
set, allows to estimate the stability of the parameters. Properties such as conver- 
gence and asymptotic normality can be checked for any particular observed data 
set. In most cases, the statistics computed on the generated data sets give a good 
idea of the confidence regions of the estimates. In this paper, we debate on the 
contribution of such methods for model selection, in the case of feedforward neural 
networks. The method is described and compared with the leave-one-out resam- 
pling method. The effectiveness of the bootstrap method, versus the leave-one-out 
methode, is checked through a number of examples. 
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1 Multilayer Perceptrons (MLP) 

Suppose a set of n independent observations of a continuous variable y that we have to 
explain from a set of p explanatory variables (x%, X2, ■ ■ ■ , x p ). We want to use the non 
linear models called Multilayer Perceptrons. These models are nowadays commonly 
used for non linear regression, forecasting, pattern recognition, and are particular ex- 
amples of artificial neural networks. In such a network, units are organized in successive 
layers with links connecting one layer to the following one. See Cheng et Titterington 
or Hertz et al [8 1 for details or references. 

We consider in the following a multilayer perceptron (MLP) with p inputs, one 
hidden layer with H hidden units and one output layer. The model can be analytically 



expressed in the following form : the output y is given by : 

/ H v \ 

V = 00 \w + ^2 w h<f>(bh + ^2 W 3 hX l) I + 6 ^) 

where e is the residual term, with zero mean, variance a 2 (with normal distribution or 
not), 

y is a continuous variable, 

<po is the identity output function 

(j> is (in most cases) the sigmoid : 

4>{x) = 



1 + exp(— x) 

Let = (w ,wi, . . . , Wh,Wu, . . . , w p h) be the parameter vector of the network 
and let y(x; 9) be the computed value for an input x = (x\, . . . , x p ) and a parameter 
vector 8. There are H (p + 1) + H + 1 parameters to estimate. 

Classically, if there are numerous data, the first step consists in the division of the 
supplied data into two sets : a training set and a test set. The so-called training set : 

{(xi;yi), • • • , (x m ;y m ); (1 < i < m;m < n)} 

is used to estimate the weights of the model by minimizing an error function : 



-y2(yi-y(xi;0)f 



m . 

using optimization techniques such as gradient descent, conjugate gradient or quasi- 
Newton methods. 

The resulting least squares estimator of is denoted by 6, and the resulting lack of 
fit for the training set is the learning error : 



m 2 

MSE a = - , $2(y i -y(x i ;d)) . 



m 

i=l 



(2) 



The training set is used to derive the parameters of the model and the resulting 
model is tested on the test set. A good regression method would generalize well on 
examples that have not been seen before, by learning the underlying function without 
the associated noise. The test error can be defined by : 

MSE t = V ( Vl -y{ Xl -e)) . (3) 

n — m \ / 



Most optimization techniques (that are variants of gradient methods) provide local 
minima of the error function and not a global one. Practically, different learning con- 
ditions (initialization of weights, learning adaptation parameter, sequential order in the 



sample presentation,. . . ) give different solutions that it is difficult to compare. It is not 
easy to know if a minimum is reached, because the decrease of the error function is 
slow, an over-learning phenomenon can occur, etc. ..For these reasons, numerous stop- 
ping and validation techniques are proposed, see for example Borowiak [l], or Hertz et 



For multilayer perceptions, the choice of a model is equivalent to the choice of 
the architecture of the network. If one has to select a model among a lot of them, 
an exhaustive (but not realistic) method would consist in exploring the whole set of 
possible models, and in testing all these models on the given problem. The estimation 
of the performances is then a very crucial point, all the more so since many factors 
intervene to complicate this evaluation. It is necessary to be certain that the convergence 
has occurred, to have at disposal a good quality criterion which allows to decide what 
is the best model. In fact it is impossible to try all the possible models, so bootstrap 
method can be very useful. 

2 Bootstrap for parameter estimation 

Bootstrap techniques were introduced by Efron [5 ] and are simulation techniques based 
on the empirical distribution of the observed sample. Let x = (x\, . . . ,x n ) an n- 
sample, with an unknown distribution function T, depending on an unknown real pa- 
rameter 0. The problem consists in estimating this parameter by a statistic = s(x) 
from the sample x and in evaluating the estimate accuracy, although the distribution 
T is unknown. In order to evaluate this accuracy, B samples are built from the initial 
sample x, by re-sampling. These samples are called bootstrapped samples and denoted 
byx* b . 

A bootstrapped sample x* b = (xl b , . . . , x* b ) is built by a random drawing (with 
repetitions) in the initial sample x : 



where Pjj is the uniform distribution on the original data set x = (x\, . . . , x n ). The 
distribution function of a bootstrapped sample x* b is T, i.e. the empirical distribution 

of x . A bootstrap replicate of the estimator 8 = s(x) will be 6* b = s(x* b ). For exam- 
ple, for the mean of the sample x, the estimator is s(x) = — Y^i=i x i> m ^ a bootstrap 
replicate will be s(x* b ) = i x f '■ 

Then, the bootstrap estimate of the standard deviation of 8 denoted by &boot(0) is 
given by 



aim. 



Pu(x* 



x i) = -; hi = (!,•••,») 




and 

»•<■>- 

6=1 

It is computed by replacing the unknown distribution function T with the empirical 
distribution T. In conjonction with these re-sampling procedures, hypothesis tests and 
confidence regions for statistics of interest can be constructed. 

In the following, the method we propose as a tool to select a MLP model is similar to 
the bootstrap method, since it relies on re-sampling techniques, but it is non parametric. 

3 Bootstrap applied to selection model for MLPs 

Let Bo be a data set of size n, 

B = {(xijyi), . . . , (x n ;y n ); (1 < i < n)} 

where x.- L is the i-th value of a p-vector of explanatory variables and yt is the response 
to Xi. From the original data set Bo (called initial base), one generates B bootstrapped 
bases B£, 1 < b < B, (i.e. B uniform drawings of n data points in Bo with repetitions). 
For any generated data set B£, an estimator of the MLP parameter vector 9, denoted by 

~*b _ 

, is found by application of the backpropagation algorithm [9 1 for example, but any 
minimization algorithm can be used. So the bootstrap procedure provides B replica- 

tions 6 for model ([1). 

Then we use Bo as a test base, and evaluate for each b = 1, . . . , B and each i = 
1, . . . , n the residual estimate : 

The study of the histogramms of these estimated residuals allows to evaluate the 
distribution of the error term e, to control its whiteness, etc. For each bootstrapped 

sample B£,b — 1,...,B, (that is for each 8 ), the sum of squares of the residuals on 
the test base Bo is computed : 

n 

TSSE{b) = Y J «lt, l ) 2 
as well as the mean of the squares of the residuals on the test base Bo ■ 

1 71 9 

TMSE(b) = -J2(4L,i) ■ 

i=l 

So, we get a vector TMSE whose mean value is : 

1 B 

fiboot = gY. TMSE ^ (4) 

b=l 



and standard deviation is : 



Gboot 



B 



i B 

— J2(TMSE(b)-fi bo 

6=1 



1/2 



(5) 



These two values measure the residual variance of the model, estimated from the 
bootstrapped samples, and the stability of the parameter vector estimations. So this 
technique allows to evaluate a model from only one sample (without splitting it into a 
training base and a test base, which decreases the number of data used for the estima- 
tion). 



1. To generate B samples of size n by random drawings with repetitions in 
the initial base {$0} = {(xx,yi), . . . , (x n ,y n )}. Let us denote by {Bl} = 
{{x\ b i yl b ), . . . , {x* b , y„ b )} the 6-th bootstrapped sample, b=l,...,B. 

2. For each bootstrapped sample, b = 1, ,..,B, to estimate by minimizing 

Er=i[^-yK b ;e)] 2 ^eget^. 



3. The bootstrap standard deviation is given by: 

B 



O~boot 



-L-J2(TMSE(b)- t , bo 

6=1 



1/2 



where 



l^boot 



1 B 

-Y,TMSE{b). 



6=1 



Table 1 : Re-sampling algorithm (bootstrap procedure) used to compute [i bo ot and a b oot 
(typically 20 < B < 200). 



To choose between several architectures Mi,M2, . . ., these computations are re- 
peated for each of them, and the best one will be this one that has the best compromise 
(the ideal would be to simultaneously minimize iL boot and cr boot ). The approach is sum- 
marized in table Q] 

Two main disadvantages must be outlined 

• the computer simulation time: if n or p is high, computation time can be very long 
even with second-order optimization techniques as BFGS, but it still remains less 
than computing time for empirical exploration 

• the repetition of extremal data: the risk exists to select a re-sampling data set 



for which iterative methods will converge with difficulty. But ignoring these 
repetitions could introduce a bias. 

Many other re-sampling procedures have been proposed in the statistical literature: 
cross-validation, Jackkniffe, leave-one-out, etc . . . See Hamamoto |7 1 and Borowiak 1 1 1 
for details. 

4 Examples 

We wish to illustrate the bootstrap method on two examples with simulated data. The 
third example is an application of our method on a real data set. For each example, we 
built B = 50 bootstrapped samples and three models with different architectures are 
compared, in order to choose the best one. 

A comparison is made with the leave-one-out method, with is also based on data 
bases replication, but in a different way. We use an uniform distribution on the orig- 
inal data to leave one observation. Hence, we train the MLP on B = 50 data bases 
replications with n — 1 observations, and we compute the values TMSE(b) using the 
observation that we left as a test base. We use the same B for both methods to be able 
to compare them using the same number of replications. We get a vector TMSE and 
compute its mean /i; 00 and its standard deviation ai 00 , as before. 

4.1 Example 1 : Linear model 

Consider the problem of fitting a linear model : 

y = 8 Q + 6\X\ + 6 2 x 2 + . . . + p Xp + e. 
We simulate a data set Bq — (xf* , x^ , y^), i = 1, . . . , 500 by putting : 

= i, x { 2 ] =iK Vi = 2 + 0.7x[ i} + 0.54° + e< 

where e, is a random variable which possesses the distribution Af(0, 4), (4 is the vari- 
ance). We consider three models : 

• Model Mi : p = 2, y = 9 Q + 61X1 + 6 2 X2 + e : true model 

• Model M 2 : p = 1, y = 6 + 9\Xi + e 

• Model M 3 : p = 3, y = 9 + dix x + 9 2 x 2 + 6*3^3 + e, with ar^ = »f and 6 3 = 1 

We compute /i& oat (Mi), m 00 (Mi) (Eq@), cr booi (Mi) and cr ;oo (Af l ) (Eq|3) for each 
model, the results are in Tab|2] With the bootstrap method, we see that the best model is 
the model Mi i.e. the true model. With the leave-one-out method we cannot conclude, 
because there is no significant differences between the 3 values of /ii 00 and of <ji 00 . 
Notice that the mean fii 00 is over-estimated and that ai ao has an order 10 times greater 
than (Jboot- 



4.2 Example 2 : Non-linear modeling with simulated data 

We use Eq[T]with sigmoid transfert function <j) to simulate a data set : 

B Q = {xf,x ( i\y i ), i=l,..., 500 

by computing yi as a noisy output of a multilayer perceptron, defined by : 
p = 2 input variables, 

xi ~ A/"(0.2,4), 

x 2 -7V(-0. 1,0.25), 

there are one hidden layer and 4 neurones on the hidden layer, 
6 = (0.5,-0.1,0.2,0.5,-0.4,0.2,0.1,3,0.3,2,0.5,0.1,0.2,2,0.2,3,0.1), as de- 
fined in section 1, 

e possesses a distribution 7V(0, 0.04). 

We consider three models : 

• Model Ma : two inputs, one hidden layer with 2 hidden neurons 

• Model A^4 : two inputs, one hidden layer with 4 hidden neurons : true model 

• Model Me : two inputs, one hidden layer with 6 hidden neurons 

We compute n b oot{Mi), ^ioo(M) (Eqj4]l, (J boot (Mi) and cr ioo (M i ) (Eq0l for each 
model. Tabf2]shows the results. Boostrap method shows that the best model is the model 
Mi. It is not the true model, but it is the best. It is not so surprising since the Multilayer 
Perceptrons are always over-parametrized, and that there is no unicity of the multilayer 
perceptron function which can model a given function. With the leave-one-out method, 
we cannot conclude, because it eliminates the true model, and do not separate the first 
and the third models. 

4.3 Example 3 : Non linear model with real data 

In this section, we study a real data set to set the efficiency of the model selection 
method that we propose. 

The power peak control in the core of nuclear reactors is explored. The problem has 
already been studied in the past, namely by Gaudier [6|, who constructed a neuronal 
model with 22 input variables, 2 hidden layers, (the first one with 26 neurons, the other 
with 40 neurons). The model accounts for physical localization of uranium bars and 
diffusion processes, and was set to reproduce the classical calculus code, while winning 
in terms of computing time. 

• Model M40: 22 inputs, two hidden layers with respectively 26 and 40 hidden 
neurons 

• Model 1/35: 22 inputs, two hidden layers with respectively 26 and 35 hidden 
neurons 



• Model A/30: 22 inputs, two hidden layers with respectively 26 and 30 hidden 
neurons 

For each model, we compute /4, 00 t(M;), m 00 (Mi) (EqHJi, a boot {Mi) and cr ioo (M i ) 
(EqS. 

The bootstrap method (Tabf2]i shows that the model M30 seems to be the best, (its 
residual variance is the smallest for a similar value of Hboot)- The leave-one-out method 
confirms our conclusion in this case. But Uboot « &ioo for each model, which is 
important to ensure the stability of the model. In that case, it would be necessary to 
study other architectures different from the three that we have considered. 



Model 


Bootstrap 


Leave-one-out fl 


Hboot 


&boot 




&ioo 


Exp 1 


Mi 


3.9525 


0.0155 


4.76268 


6.49886 


M 2 


3.9020 


0.5985 


4.81903 


6.54536 


M 3 


3.9475 


0.4259 


4.73803 


6.54557 


Exp 2 


M 2 


0,04277 


0.00019 


0.04999 


0.06807 


M 4 


0.04271 


0.00029 


0,05303 


0.07553 


M 6 


0.04277 


0.00028 


0.04895 


0.06772 


Exp 3 


M 30 


0,0473 


0.0052 


0.03961 


0.05347 


M 35 


0.0599 


0.0069 


0.05132 


0.07873 


A/ 40 


0.0492 


0.0049 


0.04763 


0.08161 



'We use 50 data bases replications for every training 



Table 2: Summary table : Comparison results of bootstrap method and leave-one-out 
method. 

We remark that in all the cases, Oboot << &ioo, so the estimation of the variance of 
the model is much more precise with the bootstrap method than with the leave-one-out 
method. 

5 Conclusion 

These examples indicate that our technique is better then the leave-one-out method. 
The bootstrap method can be used for a great variety of situations. We have applied 
it for many other cases, and the results seem to be very interesting to help for model 
selection. 
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